Skip to content

[SPARK-57782][INFRA][DOC] Make pages.yml reuse the doc image#56393

Draft
zhengruifeng wants to merge 3 commits into
apache:masterfrom
zhengruifeng:pages-reuse-doc-image-dev1
Draft

[SPARK-57782][INFRA][DOC] Make pages.yml reuse the doc image#56393
zhengruifeng wants to merge 3 commits into
apache:masterfrom
zhengruifeng:pages-reuse-doc-image-dev1

Conversation

@zhengruifeng

@zhengruifeng zhengruifeng commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Run the "GitHub Pages deployment" documentation job inside the prebuilt documentation container image ghcr.io/apache/spark/apache-spark-github-action-image-docs-cache:master-static -- the same image that the documentation job in build_and_test.yml builds and runs in. That image is produced from dev/spark-test-image/docs/Dockerfile and published by build_infra_images_cache.yml.

As a result, the following steps now come from the image and are removed from pages.yml:

  • Install Python 3.11 and Install Python dependencies (the pinned Sphinx/pandas/grpcio pip list)
  • Install Ruby for documentation generation
  • Install Pandoc

Companion changes required to build inside a container, mirroring the documentation job in build_and_test.yml:

  • set LC_ALL/LANG to C.UTF-8
  • add a git config --global --add safe.directory ${GITHUB_WORKSPACE} step (the doc build invokes git as root inside the container)
  • run dev/free_disk_space_container to reclaim runner disk now that the image also occupies it
  • keep setup-java (Java 17) so JAVA_HOME is set for the Scala/SQL doc generation, and align the Bundler install with build_and_test.yml

Why are the changes needed?

pages.yml duplicated the documentation toolchain setup -- a long pinned Python dependency list, Ruby, and Pandoc -- that is already captured in dev/spark-test-image/docs/Dockerfile and published as a reusable image. Reusing that image keeps the documentation dependencies in a single source of truth, removes the duplicated install steps, and avoids reinstalling the toolchain on every run.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Validated by running the updated workflow end-to-end on a fork. Since the workflow only triggers on push to master in apache/spark, it was temporarily enabled for the PR branch (commit c1f16f3, reverted in 821a563, so the final diff is unchanged), with only the two Pages deploy steps skipped because Pages is not enabled on the fork.

Successful run: https://github.com/zhengruifeng/spark/actions/runs/27413443557

The run pulled apache-spark-github-action-image-docs-cache:master-static, built the full documentation inside the container (SKIP_RDOC=1 bundle exec jekyll build against apache/spark@master sources, ~27 minutes), and uploaded the built site as a 114 MB github-pages artifact. Total run time (~30 minutes) is consistent with recent runs of the current workflow on apache/spark (~31-65 minutes), with the environment setup reduced to a ~50 second image pull plus a ~20 second bundle install. The skipped deploy steps (actions/configure-pages / actions/deploy-pages) are unchanged by this PR.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (model: claude-opus-4-8)

@zhengruifeng zhengruifeng changed the title [INFRA] Make pages.yml reuse the documentation image used in build_and_test [INFRA] Make pages.yml reuse the documentation image Jun 9, 2026
@zhengruifeng zhengruifeng changed the title [INFRA] Make pages.yml reuse the documentation image [INFRA] Make pages.yml reuse the doc image Jun 9, 2026
@zhengruifeng zhengruifeng changed the title [INFRA] Make pages.yml reuse the doc image [SPARK-57341][INFRA] Make pages.yml reuse the doc image Jun 11, 2026
@zhengruifeng zhengruifeng changed the title [SPARK-57341][INFRA] Make pages.yml reuse the doc image [INFRA] Make pages.yml reuse the doc image Jun 11, 2026
@zhengruifeng zhengruifeng force-pushed the pages-reuse-doc-image-dev1 branch from 512a92a to 7a82e1d Compare June 12, 2026 11:35
…d_test

Run the documentation job inside the prebuilt documentation image (apache-spark-github-action-image-docs-cache:master-static) that build_and_test.yml already uses, dropping the redundant inline setup of the Python docs dependencies, Ruby, and Pandoc now provided by the image.
Temporarily run the GitHub Pages workflow on the fork to validate the container-based doc build: trigger on push to this branch, drop the apache/spark job guard, and skip the Pages configure/deploy steps on the fork. To be reverted before merge.
@zhengruifeng zhengruifeng force-pushed the pages-reuse-doc-image-dev1 branch from 821a563 to e2854b1 Compare June 30, 2026 10:29
@zhengruifeng zhengruifeng changed the title [INFRA] Make pages.yml reuse the doc image [SPARK-57782][INFRA] Make pages.yml reuse the doc image Jun 30, 2026
@zhengruifeng zhengruifeng changed the title [SPARK-57782][INFRA] Make pages.yml reuse the doc image [SPARK-57782][INFRA][DOC] Make pages.yml reuse the doc image Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant